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Abstract 

Background: CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is a prokaryotic adaptive defence 
system that provides resistance against alien replicons such as viruses and plasmids. Spacers in a CRISPR cassette 
confer immunity against viruses and plasmids containing regions complementary to the spacers and hence they 
retain a footprint of interactions between prokaryotes and their viruses in individual strains and ecosystems. The 
human gut is a rich habitat populated by numerous microorganisms, but a large fraction of these are unculturable 
and little is known about them in general and their CRISPR systems in particular. 

Results: We used human gut metagenomic data from three open projects in order to characterize the composition 
and dynamics of CRISPR cassettes in the human-associated microbiota. Applying available CRISPR-identification 
algorithms and a previously designed filtering procedure to the assembled human gut metagenomic contigs, we found 
388 CRISPR cassettes, 373 of which had repeats not observed previously in complete genomes or other datasets. Only 
1 71 of 3,545 identified spacers were coupled with protospacers from the human gut metagenomic contigs. The number 
of matches to GenBank sequences was negligible, providing protospacers for 26 spacers. 
Reconstruction of CRISPR cassettes allowed us to track the dynamics of spacer content. In agreement with other 
published observations we show that spacers shared by different cassettes (and hence likely older ones) tend to the 
trailer ends, whereas spacers with matches in the metagenomes are distributed unevenly across cassettes, 
demonstrating a preference to form clusters closer to the active end of a CRISPR cassette, adjacent to the leader, and 
hence suggesting dynamical interactions between prokaryotes and viruses in the human gut. Remarkably, spacers 
match protospacers in the metagenome of the same individual with frequency comparable to a random control, but 
may match protospacers from metagenomes of other individuals. 

Conclusions: The analysis of assembled contigs is complementary to the approach based on the analysis of original 
reads and hence provides additional data about composition and evolution of CRISPR cassettes, revealing the dynamics 
of CRISPR-phage interactions in metagenomes. 
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Background 

Prokaryotic cells inhabiting the human body outnumber 
its own eukaryotic cells at least ten to one, with the over- 
whelming majority of bacteria residing in the intestine. 
This complex community of symbiotic, pathogenic and 
commensal microorganisms is called the microbiome [1]. 
The human gut microbiome might be considered as an 
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organ within an organ [2]. It has been shown to be indis- 
pensable for the human life as it is capable of vitamin 
production [3,4], digestion of complex polysaccharides [5], 
controlling intestinal epithelial cell proliferation through 
the production of short-chain fatty acids [6], and influen- 
cing the normal development and function of the mucosal 
immune system [7]. While bacteria are responsible for these 
functions, bacteriophages, in turn, influence their abun- 
dance in the human gut [8,9]. 

As a reaction to an ongoing phage pressure, prokaryotes 
have developed numerous defence mechanisms [10]. The 
CRISPR systems are especially interesting, as they retain 
the history of interactions between viruses and their 
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prokaryotic hosts [11-13]. Despite being highly diverse 
[14], typically they are comprised of a CRISPR cassette 
containing an array of unique spacer sequences (25-70 bp) 
alternating with conserved short direct repeats and pre- 
ceded by a 5 '-leader sequence of 200-500 bp. The systems 
also include numerous CRISPR-associated (cas) genes that 
encode proteins performing various, only partially charac- 
terized functions essential for the systems activity [15]. A 
CRISPR cassette is transcribed as a long precursor RNA 
molecule that is further cleaved into small fragments 
(crRNA), each containing one complete spacer. crRNAs 
are the main players in the RNA-guided degradation of 
foreign replicons [11,16,17]. Accumulation of new spacers 
occurs at one side of a cassette, adjacent to the leader se- 
quence, while internal spacers may be deleted by recom- 
bination. Hence older spacers are shifted to the 3 '-end, 
and cassettes retain a unique chronological footprint of vi- 
ruses that have infected a given strain [18-20]. However, a 
part of a cassette or even a complete cassette may be also 
acquired via horizontal gene transfer [21]. In addition, 
identical or similar repeats in different CRISPR cassettes 
may indicate their common ancestry. 

Up to 60% of all microorganisms that inhabit the human 
body are considered to be unculturable [22]. Culture- 
independent metagenomics is the most powerful approach 
to study the composition and dynamics of complex micro- 
bial communities. The metagenomics data allow one to 
obtain a complete snapshot of coexisting microorganisms, 
both prokaryotes and their viruses. 

To date, CRISPR systems have been analyzed in meta- 
genomic datasets of several environments including acido- 
philic biofilms [23], acidic hot springs in Yellowstone 
National Park [24], Australian hypersaline Lal<e Tyrrell [25], 
ocean metagenome produced by the Global Ocean Sampling 
(GOS) expedition [26], and the rumen microbiome [27]. 

A considerable effort is directed to large-scale investiga- 
tion of the human microbiome using the metagenomic 
approach. The main aim of these studies is to understand 
the role of the endogenous flora in health and disease. 
Among all body sites, the diversity of microorganisms in 
the human gut is known to be the highest [28]. The data 
from several human microbiome metagenomic projects 
are available, as well as human gut virome data [29-34]. A 
high level of microbial diversity and availability of meta- 
genomic datasets obtained using various sequencing tech- 
niques make the human gut microbiome a promising 
object for studying CRISPR systems. 

Indeed, CRISPR cassettes were characterized across 
body sites in different individuals through independent 
projects [35-37] and as a part of the Human Microbiome 
Project [38] with a particular attention to the gut meta- 
genome [38-40]. In these studies, raw reads containing 
CRISPR repeats were collected, followed either by the 
analysis of the spacer content [39] or reassembly of 



repeat-containing reads into contigs [38]. This approach 
allowed the authors to identify thousands of spacers, al- 
though it was limited to CRISPR cassettes with already 
known repeats. While being a powerful tool to study the 
distribution of spacers, this strategy does not account for 
the CRISPR cassette structure, and hence may not track 
the evolutionary dynamics of spacers within cassettes. 
To offset this, we identified cassettes in assembled con- 
tigs. The comparison of spacers and repeats of these cas- 
settes with previously analyzed spacers and repeats, in 
particular, those identified by read-based techniques, 
yielded only few matches. This suggests that both ap- 
proaches are useful as they produce complementary 
findings. We analyzed the CRISPR content in three hu- 
man gut metagenomes, two of which have not been ana- 
lyzed earlier in this context. We identified CRISPR 
cassettes, compared the sets of repeats and spacers with 
the ones identified in earlier studies and analyzed the 
differences, identified protospacers, reconstructed the 
taxonomy distribution of cassettes and protospacers, 
characterized the distribution of spacers and protospa- 
cers in individual metagenomes, and, finally, described 
the dynamics of spacer positions within CRISPR cas- 
settes for different classes of spacers. 

Methods 

Metagenomic datasets 
Human microbiomes 

The gut samples of the Human Microbiome Project 
(HMP) dataset were downloaded as an assembly in 
1,889,651 contigs [41]. The total length of HMP contigs 
comprised 3,732 Mb. The fecal DNA samples were col- 
lected from 124 adults of various ages (18-69) sequenced 
by Illumina GA machines [28]. 

The assembled metagenomic dataset from 13 healthy 
Japanese individuals (JPN) was downloaded from the 
CAMERA website [42]. This dataset contained 353,805 
contigs of the total length 463 Mb. The samples were col- 
lected from adults and children including unweaned in- 
fants (6 months to 45 years), comprising two families of 
three and four members, and six unrelated individuals. 
The shotgun reads were obtained using MegaBACE4500 
sequencers (GE Healthcare) [43]. 

The contigs from the Distal Gut metagenomic project 
(DG) were downloaded from the NCBI website [44]. The 
assembly contained 22,508 contigs, comprising 336 Mb. 
The reads were sequenced using the ABI 3730x1 DNA 
analyzer [45]. 

The information about metagenomic datasets used here 
is summarized in Table 1. 

Identification and analysis of CRISPR cassettes 

To construct a set of CRISPR cassettes for each metage- 
nomic dataset, we used three algorithms, PILER-CR [46], 
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Table 1 Characteristics of the analyzed metagenomic datasets 



Metagenomic project 



Number of 
contigs/reads 



Total length Source 



Individuals involved 



Sequencing platform 



Assembly 
algorithm 



The Human Microbiome 
Project (HMP) 

Healthy Human Gut 
Metagenomes (JPN) 



Distal gut metagenomic 
project (DG) 



1,889,651 contigs 



353,805 contigs 



22,508 contigs 



3,732 Mb 
(N50 =3692) 

463 Mb 
(N50 = 1180) 

336 Mb 
(N50 = 1657) 



Fecal samples 124 Europeans of various ages 
(18-69) 

Fecal samples 13 Japanese individuals 
(6 months - 45 years), 
comprising 2 families (of 3 and 
4 members) and 6 unrelated 
individuals 

Fecal samples 2 healthy adults 



llumina GA 



MegaBACE4500 sequencer 
(GE Healthcare) 



ABI 3730x1 DNA analyzer 



MetaMos 



PCAP 



Celera 
Assembler 



the CRISPR Recognition Tool (CRT) [47], CRISPRFinder 
[48], and a previously designed filtering procedure [26]. 

In addition, we attempted to use Crass [49] with de- 
fault parameters to assemble CRISPR cassettes from 
metagenomic reads. While the number of detected cas- 
settes for the DG read dataset was comparable to that 
obtained by our procedure. Crass did not assemble 
CRISPR cassettes from the HMP data. Relaxing parame- 
ters for the number of repeats and spacer or repeat 
length (-n 2 -w 6 -s 20 -S 55) allowed Crass to identify 
only one CRISPR cassette in the HMP dataset. For the 
JPN dataset metagenomic reads were not available, and 
hence Crass could not be applied. Hence, for uniformity. 
Crass predictions were not considered further. 

To determine contig taxonomy, contigs were subjected 
to the BLASTX search [50,51] against the non-redundant 
protein collection (NR) of GenBank [52] (e-value threshold 
le"^). Taxonomic labels were assigned manually based on 
the degree of consistency in the taxonomy origin of the top 
hits. Taxonomic labels at the phylum level were assigned if 
at least top ten hits belonged to one phylum; taxonomic la- 
bels at the level of class, family, and genus were assigned if 
the majority of top 30 hits belonged to the same taxon of 
that level. If top hits were taxonomically diverse, the contig 
was assigned with a nonspecific taxonomic label. A contig 
might not be assigned with a taxonomic label for the fol- 
lowing reasons: (1) CRISPR cassette covering entire contig 
length; (2) CRISPR cassette flanked by regions containing 
only universal cas genes, known to be subject to frequent 
horizontal gene transfer, so that their phylogeny does not 
necessarily reflect taxonomy [53]; (3) flanking regions con- 
taining genes with no significant similarity to any entry in 
the non-redundant GenBank collection. The cas genes 
were identified as described previously [26]. 

We considered two types of data to perform the 
BLASTN search in order to identify sources of spacers 
(protospacers). Firstly, we compared spacer sequences to all 
viral entries, including complete genomes, from GenBank. 
Secondly, we compared the spacer sets with the human 
metagenomic datasets themselves, assuming that these data 
may still contain contigs of phage, prophage, or plasmid 



origin after filtering out small particles (according to the 
metagenome DNA isolation protocol [54]). 

If a mismatch between two similar sequences is located 
at a distance less than one word from the sequence end, 
BLASTN would not extend the alignment over this 
mismatch. Since the alignments between spacers and 
protospacers are necessarily short, this means that 
spacer-protospacer pairs with mismatches in the middle 
might be aligned by BLASTN only partially. To offset 
this and to estimate the real number of mismatches in 
identified spacer-protospacer pairs, all obtained hits 
were postprocessed, and if the observed matching re- 
gions were shorter than the corresponding spacer se- 
quence they were extended in one or both directions to 
match the full length of the spacer. 

The number of mismatches was calculated for extended 
alignments of spacer-protospacer pairs and a threshold of 
four mismatches along the entire spacer length was set to 
define candidate protospacer sequences. To ensure that a 
sequence matched by a spacer is not an undetected CRISPR 
cassette, we performed a parallel BLASTN search for repeat 
sequences from the corresponding CRISPR cassettes against 
the same datasets as it was done for spacer sequences. 

The taxonomic labels were assigned to protospacer- 
containing contigs as described above, and then transferred 
to the respective spacers as follows: if the protospacer was 
of a phage or plasmid origin, we used the taxonomical 
information about its host, whereas if the protospacer 
came from a sequence of a bacterial origin, its taxonomy 
was assigned as described above. If the spacer was 
already assigned with a taxonomical label, the assign- 
ments were compared. 

In order to estimate the significance of the observed 
similarities between spacers and a sequence database, we 
generated randomized sets of "pseudospacers", where each 
spacer was replaced by a random fragment of the same 
length and, if possible, from the same contig. The range 
for randomization excluded regions covered by CRISPR 
cassettes. If a CRISPR cassette covered a metagenomic 
contig (almost) completely, i.e., if both flanking sequences 
were shorter than 100 nt, a fragment of the same length 
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was taken from a randomly selected contig coming from 
the same individual but not containing predicted CRISPR 
cassettes. 

However, a straightforward application of this procedure 
would be confounded by gene homology. This would mani- 
fest as similarity extending beyond the pseudospacer- 
pseudoprotospacer pair. To avoid this possibility, we 
extracted the flanking sequences of selected pseudospacers. 
These sequences had the same length as the repeats in the 
respective cassette. They were run against the same datasets 
as pseudospacers. A pseudospacer-pseudoprotospacer pair 
was taken into account only if none of its pseudospacer- 
flanking sequences matched the same sequence. 

Repeat clusters were constructed as described previ- 
ously [26] using the standard BLASTCLUST procedure 
applied to the set of consensus repeat sequences (param- 
eters: -L 0.5 -S 50 -e F -p F -W 15). All clusters with 
more than one member were collected. Alignments for 
the obtained clusters were constructed using the stand- 
ard MUSCLE procedure [55]. For further analysis, the 
repeats were considered to be similar if they belonged to 
the same repeat cluster. 

To search for PAM sequences (protospacer adjacent 
motifs), 10 nt regions flanking protospacers from both 
sides were used [56]. 

CRISPR cassettes were oriented, when possible, accord- 
ing to the position and direction of transcription of cas 
genes. An end of a cassette was labeled as the leader 
terminus if the adjacent region of the contig contained a 
cas gene in the proper orientation. In addition, cassettes 
lacking flanking sequences of sufficient length were ori- 
ented by comparison to cassettes from the same repeat 
cluster, for which the orientation had been already 
assigned, assuming that the repeat should be encoded on 
the same strand for the entire cluster. For cassettes not be- 
longing to clusters with defined orientation, the leader and 
trailer termini could not be determined, and such cassettes 
were not considered in the orientation-dependent analyses. 

Targeting spacers were defined as spacers having at 
least one reliable protospacer in the same individual 
metagenome. Shared spacers were defined as spacers ob- 
served in two or more individual metagenomes. 

To estimate whether targeting spacers tend to occur 
close to the leader-end of cassettes, and shared spacers, 
to the trafler-end, the following Monte-Carlo simulation 
was implemented. Only complete cassettes (with non- 
cassette flanking sequences) with defined orientation, 
were used. Spacers in each cassette were enumerated. 
For each cassette, serial numbers of all targeting (resp. 
shared) spacers in the cassette were summarized. Then 
the obtained statistics were summarized for the whole 
set of considered CRISPR cassettes. Hence, we obtained 
a single value equal to the sum of all serial numbers of 
all targeting (resp. shared) spacers. After that, spacers in 



each cassette were randomly shuffled and the same pro- 
cedure was applied. It was repeated 100,000 times and 
the distribution of the analyzed statistic was built for the 
targeting and shared spacers. Then the statistic obtained 
for real cassettes was compared with the constructed 
distributions, and the /^-values were calculated. 

To check whether spacers and protospacers tend to co- 
occur, we performed the following test. For each indivi- 
dual we constructed a 2 x 2 contingency table featuring 
the number of spacer-protospacer pairs with spacer (resp. 
protospacer) coming from this individual. For the result- 
ing set of contingency tables, the Cochran-Mantel- 
Haenszel (CMH) statistic was calculated [57]. As a nufl 
hypothesis we assumed that the occurrences of spacer and 
corresponding protospacer in an individual are independ- 
ent. To check whether our data fit this hypothesis, we 
shuffled protospacers across the individual metagenomes, 
so that the number of protospacers in a given individual 
remained unchanged. This procedure was performed 
10,000 times, and in each round of permutations the 
CMH statistic was calculated. The obtained distribution 
was used to estimate the value of the observed statistic. 

Results and discussion 

Characteristics of CRISPR cassette sets 

We used three publicly available human metagenomes to 
search for CRISPR cassettes. The latter were identified by 
several existing programs (see Methods). To exclude false 
predictions of CRISPR cassettes in the metagenomic data, 
a filtering procedure was applied [26]. This procedure re- 
tains the following types of cassettes: (1) cassettes predicted 
by the CRT, CRISPRFinder and PILER-CR programs sim- 
ultaneously; (2) candidate cassettes (cassettes predicted by 
only one or two programs listed) adjacent to cas genes; (3) 
candidate cassettes whose repeat consensus is similar to 
the repeat consensus of a cassette already accepted based 
on (1) or (2). 

The sets of identified cassettes are shown in Figure 1 
and characterized in Table 2 and Additional file 1: Table 
SI. The largest set of cassettes was identified in the JPN 
dataset, followed by HMP, and few cassettes were ob- 
served in the DG metagenome. Among the algorithms, 
the largest number of candidate cassettes was produced 
by CRISPRFinder, followed by CRT and PILER-CR, with 
considerable overlap between the predictions (Figure 1). 
Examination of individual predictions demonstrated that 
CRT and PILER-CR tend to consider genomic repeats and 
low-complexity regions as a candidate CRISPR cassette, 
whereas CRISPRFinder reports numerous short cassettes 
of the type "repeat-spacer-repeat". 

Then, we considered candidate CRISPR cassettes identi- 
fied by at least one of the programs, and adjacent to cas 
genes (Table 2). There are several possible reasons why 
these cassettes have not been identified initially, including 
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short cassette lengths, varying length of spacers, divergent 
repeats, etc., confounding individual programs. Notably, of 
132 cassettes adjacent to cas genes, 119 (90%) had repeats 
not observed in other databases or complete genomes. 
That done, no candidate cassettes with repeats similar to 
the repeats of already accepted cassettes were observed. 
Hence, filtering condition (3) turned out to be redundant. 
This proves robustness of the procedure. 

The final set of CRISPR cassettes consisted of 298, 78, 
and 14 cassettes from the JPN, HMP, and DG metagen- 
omes, respectively. Two distinct cassettes were never 
observed in one contig. In all three metagenomic data- 
sets, a considerable fraction of cassettes were adjacent to 
putative cas genes. We detected 70, 56, and 6 such cas- 
settes in the JPN, HMP, and DG metagenomes, respect- 
ively, comprising 24%, 71%, and 43% of the total cassette 
number in the respective set. 

The set of 298 CRISPR cassettes in the JPN metagen- 
ome contained 3410 spacers, comprising 2992 unique 
spacers. 378 spacers from 78 HMP cassettes comprised 
352 unique ones. Only one spacer out of 175 spacers 
found in 14 DG cassettes occurred twice (Table 2). The 
non-redundant set of repeat sequences contained 170, 74, 



and 11 unique repeats for the JPN, HMP, and DG meta- 
genomes, comprising 139 repeat clusters (Table 2). 

Once a reliable set of CRISPR cassettes was constructed, 
we compared consensus repeat sequences from these cas- 
settes (Additional file 1: Table SI) with repeats from already 
known CRISPR cassettes deposited in CRISPRdb [58]. Only 
23 of 255 identified unique repeat sequences matched repeats 
from the CRISPRdb database. All matched repeats originated 
from the JPN metagenome and corresponded to 17 repeat 
clusters. Such a small intersection with CRISPRdb indicates 
that most CRISPR cassettes identified here are novel. 

Generally, two different approaches to identification of 
CRISPR cassettes in metagenomic data are feasible: mak- 
ing prediction on assembled contigs or extracting spacers 
directly from raw reads. CRISPR prediction on assembled 
contigs retains the order of spacers in a cassette. On the 
other hand, assembly of sequences containing repeats is 
difficult, and hence a considerable fraction of CRISPR 
cassette-containing reads would remain unexplored. The 
other approach, recently used to identify CRISPR cassettes 
in the human gut metagenome [39], analyzes raw reads 
and extracts spacer sequences flanked by sufficiently long 
repeat segments from already known CRISPR cassettes 
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Table 2 Statistics of identified CRISPR cassettes and spacers 



Metagenome dataset 



JPN 



HMP 



DG 



Cassettes 



Identified by: 
PILER-CR 
CRT 

CRISPRFinder 

All three programs 

1 or 2 programs, but adjacent to cas genes 
Final set of cassettes: 
Total number 

Cassettes adjacent to cas genes 
Cassettes with assigned taxonomy 
Cassettes with assigned CRISPR-cas type: 

I type 

II type 

III type 

Total number 

Unique spacers 

Spacers with protospacers in: 

The same metagenomic dataset 

NR database 



Spacers 



Repeats 



Unique 

Repeats with matches in CRISPRdb 

Repeats from known clusters according to the CRISPRmap algorithm 



322 
359 
361 
272 
24 

296 
70 (24%) 
73 (25%) 

18 (6%) 
9 (3%) 
6 (2%) 

3410 
2992 

136 
17 

170 
23 
122 



121 
149 
235 
45 
33 

78 

56 (71%) 
69 (82%) 

16 (20%) 
4 (5%) 
18 (23%) 

378 
352 

59 
9 

74 
0 

18 



17 
21 
22 
13 
1 

14 
6 (43%) 
9 (64%) 

1 (7%) 
1 (7%) 
1 (7%) 

175 
174 

0 
0 

11 
0 



Columns correspond to three metagenome datasets. 



from existing databases. Here, the set of identifiable spacers 
is limited by the set of known repeats. A combination of 
the described strategies, named "targeted assembly of 
CRISPR cassettes", first selects reads matched by known 
repeat sequences or predicted by a program (CRT), and 
then reassembles them into CRISPR cassettes [38]. We 
compared our results with those produced by these two 
approaches on the human gut metagenomic data. 

The CRISPR set identified by Stern et al [39] in raw 
HMP reads contains 52,267 spacers, 48,484 of which are 
unique. Comparing these spacers with the spacer sets 
identified here, we found only 15 matches in our set of the 
HMP spacers (originating in four different cassettes; only 
three spacers from the Stern set exactly matched spacers 
from the HMP set) and 125 matches in the JPN set (ori- 
ginating in 40 different cassettes). No matches with 
spacers from the DG set were observed. The matched 
spacers comprise 3% and 4% of our unique spacers in the 
HMP and JPN sets, respectively, i.e., roughly the same 
fractions of the whole sets of unique spacers. 



The fact that only few of the spacers from Stern et al 
[39] matched spacers identified here could be caused by 
two reasons: either these spacers or the respective cas- 
settes were present in the HMP assembled contigs, but 
had been missed by our identification procedure, or the 
reads containing these spacers were not assembled in 
the contigs. To distinguish between these alternatives, 
we performed BLASTN search for unique spacers from the 
Stern set against all assembled HMP contigs. Most spacers 
(39,273, 81%) did not match contigs. The remaining spacers 
and the matching contigs were analyzed in more detail. As 
repeats identified by Stern et al. were not available, we 
checked whether HMP contigs matched by the Stern 
spacers contained repeats from the HMP repeat set (iden- 
tified here) and/or known CRISPR repeats from CRISPRdb 
[58]. With the exception of the contigs with three spacers 
that exactly matched HMP spacers identified here and 
twelve spacers with non-exact matches (up to four mis- 
matches), none of contigs matched by the Stern spacers 
contained repeats either from CRISPRdb or from our 
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HMP repeat set. Further, six spacers from the Stern set 
matched HMP contigs with questionable (in the CRISPR- 
Finder notation) CRISPR cassettes, which had been pre- 
dicted by only one or two algorithms due to a low level of 
repeat conservation and varying spacer lengths within a can- 
didate cassette. This strongly suggests that we did not miss 
any considerable number of identifiable cassettes with 
spacers from the Stern set while the false negative rate of 
our predictions given the available data is low. On the other 
hand, a considerable fraction of the Stern spacers (9,100, 
19%) matched HMP contigs without repeats or CRISPR 
cassettes predicted by either algorithm. Given that the 
Stern et al, [39] procedure relies on known repeats, 
these matches could be protospacers of those spacers. 

The set of CRISPR cassettes identified in the HMP data 
by the targeted assembly approach contained 150 cas- 
settes, 86 of which were found in gut samples [38] and we 
used the latter for further analysis. Comparison of these 
data with the repeat sequences identified here yielded only 
four matches with repeats from the HMP set, originating 
in four different CRISPR cassettes. Only 25 of HMP re- 
peats identified by Rho et al [38] had matches in the as- 
sembled HMP contigs. These contigs did not pass our 
filters as they did not contain CRISPR cassettes identified 
by the three programs due, in particular, to short cassette 
length or degenerate repeats; neither they contained cas 
genes. As in the previous case, it shows that the Rho cas- 
settes absent in our set have been produced by reads that 
had not been assembled into contigs. 

Unfortunately, we could not make a universal comparison 
as the necessary data were not available - only spacers were 
provided in [39] and only repeats, in [38]. Still, this analysis 
shows that CRISPR cassettes identified by our approach are 
mostly novel and considerably different from CRISPR cas- 
settes found in human gut microbiomes earlier, and hence 
the contig-based and read-based approaches produce 



complementary results. Indeed, the read-based approach 
missed cassettes with new repeats, while the contig-based 
techniques missed cassette fragments in unassembled reads. 

The number of CRISPR cassettes in individual meta- 
genomes varied. No dependence between the number 
of identified cassettes or spacers and the average con- 
tig size or sample size could be observed (Additional 
file 2: Figure SI); however, the sequencing technolo- 
gies and assembly algorithms could be responsible for 
the observed differences between the datasets. On the 
other hand, the number of identifiable CRISPR cassettes 
might reflect the major taxonomic breakdown of human 
gut microbiota and, indirectly, enterotypes of particular 
individuals [59]. 

Taxonomy of metagenomic contigs containing 
CRISPR cassettes 

To define the taxonomic origin of contigs containing the 
identified cassettes, a BLASTX-based procedure was used 
(for details see Methods). The short length of metagenomic 
contigs combined with the propensity of cas genes to hori- 
zontal gene transfer mal<es taxonomic predictions for 
CRISPR-containing contigs difficult. We assigned the tax- 
onomy at least at the domain level to 73 of 296 JPN cas- 
settes (25%), 69 of 78 HMP cassettes (82%), and 9 of 14 DG 
cassettes (64%). The differences in the average fractions of 
CRISPR-containing contigs with assigned taxonomy reflect 
the average lengths of the contigs in the respective samples. 

Despite the fact that the total number of cassettes iden- 
tified in the three studied metagenomes was considerably 
different, the prevalent taxa of contigs with CRISPR cas- 
settes were similar in all three datasets (Figure 2). The lar- 
gest fraction of contigs with assigned taxonomy belonged 
to Firmicutes. We observed 33, 43, and 8 contigs of the 
Firmicutes origin in the JPN, HMP, and DG metagenomes, 
respectively, with the majority of them belonging to Bacilli 



A (JPN) 




jBacteroidetes 2% 
Firnnicules 1 1 % 



Proteobacteria 2% 
Actinobacteria 6% 



B (JPN, children) 




^Firmicutes 13% 

Actinobacteria 13% 



C (HMP) 




D(DG) 



Firmicutes 50% 



Bacteroideles 7% 




Figure 2 Taxonomy of CRISPR-containing contigs. The presented metagenomes are: JPN (A); JPN, children only (B); HMP (C); DG (D). The 

abbreviation "nd" stands for "not determined". 
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and Clostridia. In the JPN sample, 20 contigs were ob- 
served in adults and 13, in children. 

The second major group in the JPN metagenomic data- 
set was comprised of 19 contigs (5 in adults and 14 in chil- 
dren) from Actinobacteria with a majority of them 
assigned to the Bifidobacterium genus. Contigs belonging 
to Proteobacteria and Bacteroidetes/Chlorobii group com- 
prised about 2% each. Proteobacterial contigs were pre- 
dominantly coming from Enterobacteriaceae, in particular, 
from Escherichia coll For about 25% of JPN contigs, no 
strong evidence of any particular bacterial phylum was de- 
tected, so these contigs were generically assigned to Bac- 
teria (Figure 2 A, B). According to a previous analysis of the 
JPN metagenome [43], the major constituents in adults 
and children are different. CRISPR-containing metage- 
nomic contigs originating from children were assigned to 
the same predominant taxa as the whole dataset, but the 
fraction of contigs assigned to Actinobacteria was larger 
(13%) (Figure 2B). 

As mentioned above, the largest fraction of HMP CRISPR- 
containing contigs (56%) was assigned to Firmicutes 
(Figure 2C). 22% of HMP contigs showed a clear bacterial 
origin but could not be assigned to any particular phylum, 
while for 16% of contigs, the taxonomic origin could not be 
determined. A minority of CRISPR-containing contigs in 
the HMP metagenome originated in Actinobacteria, Bacter- 
oidetes, and Proteobacteria, together comprising less than 
6%. In the DG metagenomic dataset, one cassette was 
assigned to Bacteroidetes and one, to Actinobacteria 
(Figure 2D). No archaeal contigs containing CRISPR cas- 
settes were observed in any human gut metagenome. 

The taxonomy breakdown of the CRISPR-containing con- 
tigs slightly differs from that based on 16S rRNA for the en- 
tire metagenomic datasets. According to 16S rRNA, the 
major constituents of the JPN metagenome in adults and 
weaned children were always Bacteroidetes followed by sev- 
eral Firmicutes genera and the genus Bifidobacterium, In 
the case of infants, representatives of Bifidobacteriaceae and 
Enterobacteriaceae were predominant [43]. This does not 
agree with the prevalence of Firmicutes CRISPR-containing 
contigs in the JPN metagenome. In HMP, the major fraction 
of the microbial diversity, according to 16S rRNA, is com- 
prised by Firmicutes, followed by almost equal fractions 
assigned to Bacteroidetes, Actionobacteria and Proteobac- 
teria, i.e., it is very similar to the taxonomic diversity of 
CRISPR-containing contigs predicted in the HMP metagen- 
ome. In the DG metagenome, the majority of 16S rRNA se- 
quences were assigned to Firmicutes and a smaller number, 
to Actinobacteria [45]. This breakdown generally matches 
the taxonomic breal<down of CRISPR-containing DG con- 
tigs with one exception: according to our data, one CRISPR- 
containing contig was assigned to Bacteroidetes. Probable 
reasons for these discrepancies might be biases in the esti- 
mation of the taxa abundance due to variability of the 16S 



rRNA genes copy number ranging from 1 to 15 per bacter- 
ial genome [60], or varying CRISPR prevalence in different 
bacterial phyla. 

CRISPR-cas types 

Functional CRISPR-c<3^5 immune systems consist of 
CRISPR cassettes and cas genes [14]. We classified the 
identified systems according to repeat types and associated 
cas genes where possible. The latter were found in flanking 
sequences of 130 cassettes. In a large fraction of flanking 
sequences (50, 38%) the only identified cas gene was casl 
(Additional file 2: Figure S2), a universal marker of all 
CRISPR-c<3^5 systems, hence not applicable for differentiat- 
ing system types. Among cassettes that could be classified 
according to characteristic cas genes, 34 were assigned to 
CRISPR-c<3^5 type I; 25 cassettes, to CRISPR-c<3^5 type III; 
and 14 cassettes, to CRISPR-c<3^5 type II. For 29 cassettes, 
the composition of associated cas genes was sufficient to 
assign subtypes (Additional file 2: Figure S2, Additional file 
1: Table SI). 

CRISPR repeats may be divided into types based on se- 
quence similarity and ability to form stable secondary struc- 
tures [61,62]. The repeat type is associated with certain cas 
genes, and hence the repeat sequence itself can be used as 
a classifying feature. Recently, an automated classifier of 
CRISPR repeats - CRISPRmap - was published [63]. 
CRISPRmap was designed for comprehensive classification 
of all known (i.e., publicly available) CRISPR cassettes based 
entirely on the repeat properties (sequence and secondary 
structure). Applying CRISPRmap to CRISPR repeats identi- 
fied in the human gut microbiomes resulted in assignment 
of 191 unique repeat sequences corresponding to 233 cas- 
settes to one of six superclasses (labeled A-F) (Additional 
file 1: Table SI). The representatives of all six superclasses 
were found. Superclasses F, E and D appeared to be the 
most populated ones. Of note, these superclasses contain 
families with little sequence conservation [63]. Repeats 
from 160 cassettes were not assigned with any superclass 
label according to the CRISPRmap classification; however, 
for 50 of these, a CRISPR-c<3^5 type could be determined ac- 
cording to the composition of the associated cas locus [62]. 
This suggests that these come from previously unknown 
CRISPR cassettes. 

For 83 cassettes both classification labels could be 
assigned. For 17 cassettes the CRISPR- type assignments 
did not match the repeat-based classification (Additional 
file 3: Table S2). Contradictions were observed for only 
three repeat superclasses (F, C and D). This may indicate 
that the existing correspondence between cas-gene com- 
position and repeat types should be revised. 

Identification of protospacers 

We compared each of three non-redundant spacer sets 
with the metagenome it originated from. In the JPN non- 
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redundant set, we identified 240 reliable spacer- 
protospacer pairs (Additional file 4: Table S3, Additional 
file 5: Figure S3). The observed protospacers corresponded 
to 136 different spacers (-5% of the JPN non-redundant 
set) and originated from 165 unique metagenomic con- 
tigs. For two metagenomic contigs (HumanGut_CON- 
TIG_00179657 and HumanGut_CONTIG_00179696) 
the number of protospacers was remarkably high (16 
and 10, respectively). These contigs clearly demonstrated 
a bacteriophage origin and were similar to Bacillus phi29- 
like phages. For one spacer from the JPN spacer set, we 
detected 19 protospacers, all of them coming from con- 
tig regions corresponding to bifidobacterial transposase 
genes. For 35 (15%) of non-redundant HMP spacers, 
we identified 89 spacer-protospacer pairs with proto- 
spacers coming from 59 different contigs. For the DG 
spacer set, no reliable spacer-protospacer pairs in the 
same metagenomic dataset were observed (Additional 
file 4: Table S3). 

The comparison of the spacers with the NR collection 
of GenBank yielded 75 spacer-protospacer pairs for the 
JPN spacer set, corresponding to 17 spacers (Table 2). 
Among these spacer-protospacer pairs, eleven matches 
were found in complete viral genomes. The detected pro- 
tospacers corresponded to four different spacers and were 
located in genomes of phages infecting Escherichia, Sal- 
monella, Clostridium spp. and three unspecified entero- 
bacteria (VT2-Sakai, epsilonl5, Sf6). Notably, one of these 
spacers had protospacers, all with four mismatches, in five 
different enterobacterial phages: Enterobacteria phage 
VT2-Sakai, Enterobacteria phage Sf6, Escherichia Stxl 
converting bacteriophage, Stx2 converting phage II, 
Salmonella enter ica bacteriophage SEl. This protospa- 
cer corresponded to the most conserved region in the 
lambda phage protein Ea22 gene, occurring in all five en- 
terobacterial phages. This may reflect a close evolutionary 
relationship of these sequences with the real source of the 



spacer, and, probably, this spacer might mediate CRISPR- 
dependent multiphage resistance against a group of en- 
terobacterial bacteriophages (Figure 3). 

Seven spacers coming from contigs with the same re- 
peat sequence matched protospacers in two unclassified 
phages; six residing in the unidentified phage clone 
2204_scaffoldl4 (JQ680368.1) and one, in the unidentified 
phage clone 2209_scaffoldl451 (Q680376.1). Both phages 
were isolated from human gut samples. Four protospacers 
originating from those unclassified phages occurred in 
coding sequences assigned to hypothetical proteins, while 
three protospacers were located in intergenic regions. 

Three spacers matched protospacers in the complete 
genome of Bifidobacterium longum subsp. infantis 157 F. 
Remarkably, one of these protospacers was located in a 
gene encoding a putative phage tail protein (the ruler 
protein), i.e., it originated from a prophage sequence. The 
remaining two bifidobacterial protospacers originated from 
genes encoding conserved hypothetical proteins. Of note, 
the bifidobacterial protospacers corresponded to spacers 
from two CRISPR cassettes assigned to children. 

The remaining three spacers for the JPN set matched pro- 
tospacers of enterobacterial origin, residing in plasmids from 
Escherichia coli, Salmonella enterica and Klebsiella pneumo- 
niae. One of the enterobacterial protospacers matched a 
plasmid gene encoding replication protein A from the 
repFIB replicon; another one matched a gene encoding a 
putative antirestriction protein, and the remaining protospa- 
cer corresponded to an intergenic region of various Rcoli 
plasmids (pHUSEC41-l, pUMNK88_91, pND12_96, pECO 
ED, pEK204, pEC_Bactec, p0113). Summing up, the major- 
ity of JPN protospacers found in GenBank sequences were 
of clear viral or plasmid origin. 

Only nine protospacers, corresponding to nine spacers, 
were found in the NR collection for the HMP spacer set 
(Table 2). Of five protospacers matching regions in the 
genome of Faecalibacterium prausnitzi, two resided in 



Stx1_converting_phage 

VT2-Sakai_79 

Stx2_converting_phage_II 

phage_SE1 

phage_Sf6 

spacer 



TCACGCAGTGCCTGATAGTCAATCTTGCTCAT 32 
TCACGCAGTGCCTGATAGTCAATCTTGCTCAT 32 
TCACGCAGTGCCTGATAGTCAATCTTGCTCAT 32 
TCACGCAGTGCCTGATAGTCAATCTTGCTCAT 32 
TCACGCAGTGCCTGATAGTCAATCTTGCTCAT 32 
GCGCGCAGTGCCTGATAATCAATTTTGCTCAT 32 



B 

Stx2_converting_phage_II 

VT2-Sakai_79 

Stx1_converting_phage 

phage_SE1 

phage_Sf6 



MSKIDYQALREKAEKATKG--SYIVGHTSVNQHGNLTGVFVCQKW- 43 

MKRHMSKIDYQALREKAEKATKG--SYIVGHTSVNQHGNLTGVFVCQKW- 47 

MSKIDYQALREKAEKATKG--SYIVGHTSVNQHGNLTGVFVCQKW- 43 

MSKIDYQALREAAEKATCGEWSLEYGEERFDAGDALIHREVVGYLP 46 

MSKIDYQALREAAEKATCGVWSLEYGEGRFDGDDALIHREAAGYIP 46 



Figure 3 The position of protospacer corresponding to the most conserved part of a protein related to the lambda phage Ea22 
protein in six related enterophages. (A) Nucleotide sequence alignment for spacer and identified protospacers. (B) Amino acid sequence 
alignment for the respective protospacers. The protospacer position is shown by the black frame. 
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intergenic regions, two corresponded to a gene annotated 
as the growth inhibitor protein, and the remaining proto- 
spacer matched CDS of a hypothetical protein. Three 
protospacers for HMP spacers matched sequences of 
uncultured organisms, clones LM0ABA27ZF12FM1 and 
VC1A546TR, both isolated from human gut samples. 
Finally, the remaining protospacer corresponded to a hypo- 
thetical protein-coding gene in Bifidobacterium longum 
subsp. longum. 

No reliable protospacers were associated with the 
DG spacer set both in the same metagenomic dataset 
(Additional file 4: Table S3, Additional file 5: Figure S3) 
and among viral sequences from GenBank. The observed 
low number of matches with complete and partial phage 
genomes (deposited to GenBank) probably reflects the fact 
that most of the viral space remains unexplored. 

In order to estimate the significance of detected proto- 
spacers, we performed a similar search against the NR col- 
lection for simulated pseudospacers (see Methods). For 
2,992 spacers simulating the JPN spacer set, we detected 
66 hits (mainly originating from multiple E.coli strains) 
corresponding to ten pseudospacer-pseudoprotospacer 
pairs. Unlike the real JPN spacers with identified protospa- 
cers, JPN pseudoprotospacers were mainly found in 
complete genomes of various bacterial taxa, corresponded 
to intergenic regions, and showed no tendency to match 
mobile elements or genes associated with prophages or vi- 
ruses. Based on this simulation, we posit that protospacers 
corresponding to spacers from CRISPR cassettes are not 
random matches obtained by chance and are indeed asso- 
ciated with defence against mobile elements such as vi- 
ruses and plasmids. Similar results were obtained for the 
HMP and DG spacer sets (data not shown). 

Although short protospacer adjacent motifs (PAMs) 
are considered to be a common feature of many diverse 
CRISPR systems [58,64], we could not detect any re- 
liable PAM motifs for the protospacers clustered by the 
repeats. 

Taxonomy of protospacer origin and compatibility with 
the CRISPR-cassette taxonomy 

The taxonomy of CRISPR-containing metagenomic con- 
tigs can be determined relying on either flanking se- 
quences or protospacers. When a metagenomic contig 
was assigned with both types of taxonomic labels, they 
were compared. 

Out of 296 CRISPR-containing contigs in the JPN meta- 
genome, 73 had taxonomy status assigned by the flanking 
sequences, 13 contigs had taxonomic labels based on the 
protospacers and, seven had both types of taxonomic la- 
bels. Five of them demonstrated a good concordance be- 
tween the taxonomic labels, as the assignments agreed at 
least on the level of phyla. For two contigs with an uncer- 
tain flank-based assignment ("Bacteria"), the protospacer- 



based taxonomy was more specific (Additional file 6: 
Table S4). 

Out of 78 HMP contigs with CRISPR cassettes, 48 
were assigned with taxonomic labels based on the tax- 
onomy of cassette-flanking regions, and for six contigs 
the taxonomic assignment could be made according to 
protospacers. Only three contigs had both taxonomic la- 
bels, and in all cases they were in a general agreement 
(Additional file 6: Table S5). 

Similarity of the spacer composition in the human gut 
microbiomes 

A pair of adjacent spacers was observed in more than 
one metagenome, in the JPN and HMP datasets. The re- 
spective contigs overlapped by a region containing these 
two spacers and a short flanking sequence of 134 nt. Ac- 
cording to the independently assigned taxonomic labels, 
both cassettes were of the same taxonomic origin, be- 
longing to Firmicutes. 

In individual JPN metagenomes, the largest number of 
shared spacers was observed between CRISPR cassettes 
assigned to children, in particular, for the F2X-F2Y and 
F2X-INM pairs (44 and 18, respectively). In three pairs 
(INE-INB; F1T-F2W; F2W-INA), complete CRISPR cas- 
settes with flanking sequences were shared. In almost all 
cases, shared spacers originated from CRISPR cassettes 
with identical direct repeats, the only exception being 
the F1T-F2W pair having one mismatch between the re- 
peat sequences (Figure 4). 

The number of spacers occurring in more than one in- 
dividual (shared spacers) in the JPN metagenome was 78. 
To check whether this is significantly more or less than 
expected at random, we applied a shuffling procedure. At 
that, we randomly replaced cassettes between individuals, 
so that the number of cassettes assigned to a given indi- 
vidual did not change. This procedure was applied 
100,000 times, and the number of shared spacers was cal- 
culated. The average number of shared spacers was 127, 
and the distribution is shown in Additional file 2: Figure 
S4. In all cases the number of shared spacers was larger 
than the observed one, indicating that the latter was sig- 
nificantly below the random expectation (p < 10"^). 

To estimate the CRISPR core, we considered the distri- 
bution of repeats among individual metagenomes. No uni- 
versal or widely distributed repeat clusters (indicating the 
same or similar cassettes) were detected (Additional file 2: 
Figure S5, Additional file 6: Table S6). The overwhelming 
majority of repeat clusters (290) appeared to be highly 
specific and associated with only one particular individual. 
Still, 24 repeat clusters were shared between at least two 
individual metagenomes. The most widely distributed 
repeat cluster was found in five different individuals (all 
from the JPN metagenomic dataset). At that, four re- 
peat clusters were found in individuals coming from 
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spacers: 118 
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Figure 4 Shared spacers between JPN individuals. Each square stands for an individual metagenome (male - blank square, female - square with a green 
circle). Identifier for each individual, gender, age, the numbers of predicted cassettes and spacers are specified within the respective square. Individuals with 
identifiers starting with 'F belong to families (Fl and F2), individuals with identifiers starting with 'IN' are independent. Numbers of shared spacers are shown 
on the edges connecting individuals; numbers of repeating spacers in the same individual are shown on the directed edges. 



different metagenomic datasets (HMP and JPN), i.e., 
geographically distant populations, indicating a possi- 
bility for a common CRISPR core in the human gut 
metagenomes. However, available data were insufficient 
to further address this problem. 

Spacer-protospacer co-occurrence in individual 
metagenomes 

We then analyzed whether protospacers have a tendency 
to originate in the same individual metagenome as the 
spacers (Additional file 5: Figure S3, Additional file 4: 
Table S3). We analyzed the combined human gut set that 
contained 139 individual metagenomes. 41 (32%) of indi- 
viduals had a majority of protospacers originating in 
other individual metagenomes. For 37 (28%) of such indi- 
viduals there was one particular individual metagenome 
that contained a large fraction of protospacers. The pref- 
erence of spacers to have protospacers in the same indi- 
vidual metagenome was clearly observed in the F2Y, INB 
and INR individuals, featuring a considerable number of 
spacer-protospacer pairs (26, 49 and 14, respectively). No 



cross-matching pairs between siblings (F2X and F2Y) 
were observed (Additional file 4: Table S3). Surprisingly, 
we identified a large number of protospacers in the JPN 
dataset that corresponded to spacers originating from the 
HMP dataset, much more than in the HMP dataset itself. 
A possible explanation comes from the HMP meta- 
genomic DNA purification protocol: the procedure in- 
cluded filtering of a sample suspension through 100 um 
mesh nylon membrane [43]. This procedure probably 
eliminated viral particles and subsequently viral se- 
quences in the HMP metagenome, further leading to the 
relative scarcity of HMP protospacers for HMP spacers, 
compared to JPN protospacers. The fact that we observed 
many spacer-protospacer pairs in different individuals sug- 
gests that viruses associated with the human gut are to 
some extent universal, containing a fraction of ubiquitous 
viruses present in many individuals. 

To check whether spacers and protospacers tend to 
co-occur, the CMH statistic was calculated for the real 
data and simulated data (see Methods). The CMH value 
for the observed data was 5.18, while the distribution of 
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CMH statistic for the shuffled tables is shown in Additional 
file 2: Figure S6. In the majority of cases the actual 
CMH statistic was larger than that calculated for the 
shuffled tables, (p-vBhie =0.182), however, not reaching 
significance. 

The observed lack of clear preference of spacers and 
protospacers to occur in the same individual suggests 
that CRISPR systems in most individual metagenomes 
are active and highly effective against bacteriophages. 
A similar observation was made earUer [39]. On the 
other hand, in several studies focusing on CRISPR 
dynamics in human oral microbiomes [35-37], proto- 
spacers for the respective spacers were more likely to 
be identified when the oral virome of the respective in- 
dividual was also available. So, the observed scarcity of 
identified protospacers for the gut CRISPR spacers may be 
revised when individual gut viromes will be also available 
for comparison. 

Compared with the spacer-protospacer co-occurrence 
patterns in CRISPR systems from the ocean meta- 
genome [26], the CRISPR composition of the human 
gut seems to be more homogeneous as some spacers 
happen to have protospacers in geographically dis- 
tant populations. It is a likely consequence of the 
relatively larger stability and uniformity of environ- 
mental conditions in the human gut (temperature, 
pH, saUnity, etc), compared to the physical and 
chemical characteristics of the ocean samples. The 
latter differ dramatically, so their CRISPR content 
does as well. 
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Figure 5 Positions of functionally important spacers compared 
to positions of all spacers. Targeting (A) and shared (B) spacers 
are presented. Spacers in each cassette were enumerated. For each 
cassette, serial numbers of all targeting (resp., shared) spacers in the 
cassette were summarized, and then the obtained statistics was 
summarized for the whole set of cassettes (the red dashed line). 
The plot shows the distribution of sums of serial numbers of 
targeting (resp., shared) spacers for the shuffled sets of cassettes 
(100,000 permutations) (see Methods). 



Position of targeting spacers and shared spacers 

Positions of targeting spacers, i.e., spacers having proto- 
spacers in the same individual, are shifted to the 
leader end of a cassette (j?-value < 0.0002) (for de- 
tails see Methods) (Figure 5A), whereas spacers 
shared between metagenomes of different individuals 
tended to be located close to the trailer end of a 
cassette (/7-value < 0.001) (Figure 5B), and hence are 
older. 

These observations agree with earlier reports about 
selected, experimentally studied systems. Indeed, in re- 
sponse to the phage infection. Streptococcus thermo- 
phillus alters its CRISPR loci by adding new spacer- 
repeat units to the leader end of the cassettes [18,19]. 
Further, reconstruction of CRISPR loci of an extremo- 
philic archaeon, I-plasma, showed that the leader-end 
spacers are highly diverse while the trailer-end spacers 
tend to be clonal population-wide [65]. The observed 
clonality of the trailer ends could be caused by conse- 
cutive selective sweeps that would homogenize CRISPR 
composition in a population, followed by differential 
addition of new spacers at the leader end. 



Conclusions 

We analyzed CRISPR systems in the metagenomic datasets 
of three human gut microbiomes. A detailed comparison 
with other studies [38,39] demonstrated complementarity 
of read-based and contig-based approaches — while many 
spacers remained in unassembled reads, our contig-based 
procedure could identify cassettes with new types of 
repeats and characterize the evolutionary dynamics of 
spacers within cassettes. 

In all three metagenomes, the largest fraction of con- 
tigs with CRISPR cassettes was assigned to Firmicutes. 
Comparison of the identified spacers with the GenBank 
NR collection and complete viral genomes yielded only 
few matches, revealing that the viral space remains 
largely unexplored. Contrariwise, we found an appreciable 
number of spacers originating in the same metagenomic 
datasets. Based on the implemented statistical analysis, we 
could not reject the hypothesis that the observed co- 
occurrence of spacers and their protospacers was gene- 
rated by chance, and moreover a considerable fraction of 
spacers had protospacers originating in other gut meta- 
genomic datasets. 
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On the other hand, the overlap in spacers between dif- 
ferent human gut metagenomic samples was negligible. 
Here we encounter an apparent paradox: a low spacer 
similarity between individuals versus a considerable num- 
ber of spacer-protospacer pairs originating in different in- 
dividuals. The absence of spacer-protospacer pairs in the 
same individual may result from high efficiency of CRISPR 
systems against viruses, i.e., fast elimination of the respec- 
tive bacteriophage from the environment [39]. 

On the other hand, the occurrence of protospacers in 
other individuals suggests that there exists a common viral 
core inhabiting the human gut, and that this core may be 
shared between geographically distant populations. The 
extent to which CRISPR systems acquired resistance 
against these common viruses, i.e., the presence of spacers, 
may vary between individuals. 

In comparison with CRISPRs in the ocean metagenome 
[26], the human gut appeared to be a more homogeneous 
environment as some spacers have protospacers in geo- 
graphically distant populations. A possible explanation 
comes from the uniformity of environmental conditions 
in the human gut compared with different ocean samples 
(temperature, pH, salinity, etc), A broad range of physical 
and chemical conditions in the Ocean results in variability 
of the microbial and phage composition, and, conse- 
quently, differences in the CRISPR content. 

The contig-based approach allowed us to reconstruct 
the order of spacers in CRISPR cassettes. Targeting 
spacers tend to be located closer to the leader end. As this 
is the site of addition of new spacers [18,19], this indicates 
a footprint of recent bacteriophage infections. Vice versa, 
spacers shared between individual metagenomes tend to 
be located closer to the trailer-end of the cassettes, and 
hence represent a more ancient, common state of the 
CRISPR based immunity [62,66]. 

Additional files 



Additional file 3: Table S2. Compatibility of the CRISPRmap 
classification of repeat sequences and the CRISPR-Cas system classification 
of adjacent cas genes. Inconsistent assignments are highlighted. 

Additional file 4: Table S3. Number of spacer-protospacer pairs per 
individual. Columns - metagenomes containing spacers, rows - metagenomes 
containing protospacers; each cell shows the number of spacer-protospacer pairs 
originating in the respective metagenomes. Background colors in the header 
columns and rows denote samples (yellow - DG, red - JPN, blue - HMP). 
Cells have green background if the respective value is non-trivial (>0) or 
grey background if the spacer and protospacer originate in the same sample. 

Additional file 5: Figure S3. Heatmap of the spacer-protospacer pairs 
distribution between individual metagenomes. Colors reflect the numbers 
of detected pairs (shown in the heatmap). 

Additional file 6: Table S4. Comparison of taxonomic labels assigned 
to CRISPR-containing contigs and respective protospacers for the JPN 
metagenome. Column 'protospacer' shows the taxonomic label of the 
contig containing the respective protospacer; column 'contig' shows the 
taxonomic label of the contig containing the respective cassette. Table 55. 
Comparison of taxonomic labels assigned to CRISPR-containing contigs and 
respective protospacers for the HMP metagenome. Column 'protospacer' 
shows the taxonomic label of the contig containing the respective 
protospacer; column 'contig' shows the taxonomic label of the contig 
containing the respective cassette. Table 56. List of shared repeat clusters. 
All repeats observed in at least two individuals are listed. 
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